Assessing profiles of various loan borrowers by Varsha Shewale

The original dataset has about 81 variables. Out of these I am analyzing only the below mentioned 12(now 13) variables which I am more interested in viz. “ListingNumber”,“Term”,“LoanStatus”, “Occupation”,“EmploymentStatus”,“CreditScoreRangeLower”,“CreditScoreRangeUpper”, “DelinquenciesLast7Years”,“StatedMonthlyIncome”,“DebtToIncomeRatio”, “IncomeRange”,“LoanOriginalAmount”,“ListingCategory”

Further I will be decoding the values in the ListingCategory with category names Reference: https://stackoverflow.com/questions/25432935/how-to-replace-values-with-table-map-in-r

The subsetted dataset has 113,937 observations of 12 variables (now 13).

Univariate Analysis

Below is the summary of the loan data subset

##  ListingNumber          Term                       LoanStatus   
##  Min.   :      4   Min.   :12.00   Current              :56576  
##  1st Qu.: 400919   1st Qu.:36.00   Completed            :38074  
##  Median : 600554   Median :36.00   Chargedoff           :11992  
##  Mean   : 627886   Mean   :40.83   Defaulted            : 5018  
##  3rd Qu.: 892634   3rd Qu.:36.00   Past Due (1-15 days) :  806  
##  Max.   :1255725   Max.   :60.00   Past Due (31-60 days):  363  
##                                    (Other)              : 1108  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  CreditScoreRangeLower CreditScoreRangeUpper DelinquenciesLast7Years
##  Min.   :  0.0         Min.   : 19.0         Min.   : 0.000         
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.: 0.000         
##  Median :680.0         Median :699.0         Median : 0.000         
##  Mean   :685.6         Mean   :704.6         Mean   : 4.155         
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.: 3.000         
##  Max.   :880.0         Max.   :899.0         Max.   :99.000         
##  NA's   :591           NA's   :591           NA's   :990            
##  StatedMonthlyIncome DebtToIncomeRatio         IncomeRange   
##  Min.   :      0     Min.   : 0.000    $25,000-49,999:32192  
##  1st Qu.:   3200     1st Qu.: 0.140    $50,000-74,999:31050  
##  Median :   4667     Median : 0.220    $100,000+     :17337  
##  Mean   :   5608     Mean   : 0.276    $75,000-99,999:16916  
##  3rd Qu.:   6825     3rd Qu.: 0.320    Not displayed : 7741  
##  Max.   :1750003     Max.   :10.010    $1-24,999     : 7274  
##                      NA's   :8554      (Other)       : 1427  
##  LoanOriginalAmount           ListingCategory 
##  Min.   : 1000      Debt Consolidation:58308  
##  1st Qu.: 4000      Not Available     :16965  
##  Median : 6500       Other            :10494  
##  Mean   : 8337      Home Improvement  : 7433  
##  3rd Qu.:12000      Business          : 7189  
##  Max.   :35000      Auto              : 2572  
##                     (Other)           :10976

Further I wish to analyze LoanStatus. Looking at the LoanStatus as a percentage plot

Looking at the LoanStatus using a log scale

Around 10% were charged-off, 5% defaulted on their loan, approx. 2% are past their due payment date. Around 33% have completed paying off their loan. Around 50% have ‘current’ loan status.

Let’s look at the loan term.

Approx. 21% borrowers have a loan term of 60 months, 78% borrowers have a term of 36 months; and approx. 2% have a term of 12 months.

Let’s look at the Employment Status with percent defaulting/chargedoff.

60% of Loan borrowers are Employed, 24%(approx.) are fulltime employees,approx. 5% are self employed, about 1% are unemployed, approx. 5% do not have any employment status available( This is definitely not good as employment status is key in determining whether the borrower is capable of paying off his loan) Right now it is not clear if a specific employment category is defaulting/chargedoff more than the others.

Let’s look at the Credit Scores. For Analyzing credit scores, I have created a CreditScoreMean column which is average of CreditScoreRangeLower and CreditScoreRangeUpper.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0   670.0   690.0   695.6   730.0   890.0     591

CreditScoreMean has an IQR = 60, 50% of the borrowers have CreditScoreMean between 670 and 730, maximum CreditScoreMean is 890, with minimum being at 10

Let’s look at the delinquencies for last 7 years.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

Around 80000 (70%) borrowers have zero delinquencies in the past 7 years. Further subsetting borrowers with one or more delinquencies.

Approx. 32% borrowers were delinquent in the past 7 years.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

Considering income, mean income is $5608 and median income is $4667. Some outliers are driving the mean higher.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

As anticipated the income distribution is right skewed following the 80-20 rule with very few people having very high incomes and vice versa.

Let’s look at LoanOriginalAmount.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Considering the loan amounts borrowed, based on the boxplot 75% of the borrowers have loan amount <= $12000 and the maximum loan amount is $35000.

## [1] 26909

Out of 113,937 borrowers, 26909 (25%) have loan amount > $12000

Summarizing findings of Univariate Analysis

  1. Out of 113,937 borrowers, 26909 (25%) have loan amount > 12000, median loan amount is equal to $6500, maximum loan amount is $35000 with minimum loan amount = $1000
  2. 75% of borrowers have monthly income <=$6825 with maximum monthly income at $1,750,000. This caused an extremely right-skewed distribution.
  3. 60% of Loan borrowers are Employed, 24%(approx.) are fulltime employees, approx. 5% are self employed, about 1% are unemployed, approx. 5% don’t have any employment status available( This is definitely not good as employment status is key in determining whether the borrower is capable of paying off his loan)
  4. Majority of the borrowers (approx. 78%) have a term of 36 months
  5. 33% completed their loan, around 10% were charged-off, 5% defaulted
  6. CreditScoreMean is a calculated column using the average of lowerCreditScore range and upperCreditScore range for each borrower CreditScoreMean has an IQR = 60, 50% of the borrowers have CreditScoreMean between 670 and 730, maximum CreditScoreMean is 890, with minimum being at 10 (would definitely like to investigate the borrowers with lower CreditScoreMean)
  7. Approx. 32% borrowers have 1 or more delinquencies in past 7 years.

Considering the above findings, I further wish to explore: - borrowers with low credit scores and/or unemployed. - profiles of people who were chargedoff and defaulted. - profiles of people who completed their loans. - if past deliquencies have any effect on today’s loan status.

Bivariate Analysis (part 1)

NOTE: BASED ON SUBSETS DRAWN IN FURTHER ANALYSIS, SOMETIMES I FELT THE NEED TO CONDUCT UNIVARIATE ANALYSIS TO LOOK AT THE DISTRIBUTION OF A KEY VARIABLE WITHIN
THE SUBSET ALONE.

Let’s analyze creditScoreMean vs. loan Amount

Loan Amount vs. CreditScoreMean yeilds a left skewed distribution. We can see that with an increase in the creditScoreMean borrowers are able to avail higher amount loans as expected. The surprising aspect of the plot is loans lent to borrowers with low credit scores (almost close to 10).

Let’s find the coefficient of corelation between Loan Amount and CreditScoreMean

## [1] 0.3408745

The correlation coefficient is 0.3408, which means the relation is positive and meaningful but not very strong.

Lets analyze these low credit score borrowers.

Out of the 133 low credit score borrowers, approx. - 29% completed their loan, - 26% were chargedoff, - 44% defaulted

Let’s create same plot for high credit score borrowers

Out of 113213 high credit score borrowers, - 33% completed their loans, - 10% were chargedoff, - 4% defaulted,

Clearly low credit score is a key attribute in determining future loan payment status as the percentage of defaulters/chargedoff is much higher with low credit score borrowers.

Let’s subset the low credit score borrowers to keep only defaulted and chargedoff cases.

Lets explore their occupation and employment status.

Out of the 133 lower credit score cases,where 94 are defaulting or chargedoff cases, no data is available regarding their EmploymentStatus. It is either stated as ‘Not available’ or is missing. Let’s check their occupation.

##  [1] Clerical                    Other                      
##  [3]                             Truck Driver               
##  [5] Sales - Commission          Administrative Assistant   
##  [7] Laborer                     Food Service               
##  [9] Homemaker                   Waiter/Waitress            
## [11] Skilled Labor               Sales - Retail             
## [13] Professional                Student - College Sophomore
## [15] Tradesman - Mechanic        Engineer - Mechanical      
## [17] Nurse (RN)                  Retail Management          
## [19] Food Service Management     Scientist                  
## [21] Computer Programmer         Clergy                     
## [23] Student - College Junior   
## 68 Levels:  Accountant/CPA Administrative Assistant Analyst ... Waiter/Waitress

The defaulters/chargeoff borrowers with lower credit score mean seem to have myriad occupations. Lets explore their monthly income

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1192    1949    2507    2805   33333

IQR indicates 50% of low credit defaulters/chargedoff have income in range $1200 to $2800 approx. It is surprising to see someone with monthly income of $33330 with credit score mean <= 10 and being in the defaulting/chargedoff category.

The above borrower had taken a loan =$3600 for 36 months term and he defaulted on his loan. There is no other data available for this specific defaulter.

Moving on to our other borrowers in the above subset “loan.low_credit_defChar” and comparing them to the rest of the borrowers.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.083 1041.667 1916.667 2204.233 2465.875 9233.333

Lets do a side by side comparison of Incomes between low credit defaulters/Chargedoff and other low credit borrowers

## No id variables; using all as measure variables

As seen from the side by side boxplots, the income distribution between low credit defaulters/chargedoff and low credit others is NOT much different. Infact the numbers seen in summary stats for low credit non defaulters non chargedoff are lower looking at the IQR and mean!

Bivariate Analysis (part 2)

Now let’s explore relation between creditScoreMean and deliquencies

The graph of the CreditScore Vs. Deliquencies does not show a straight-forward relation. I was hoping to see a negative corelation as higher number of deliquencies should adversely affect the credit mean but even with as high as 100 deliquencies the credit score mean is in 500-750 range approx.

Let’s explore StatedMonthlyIncome vs. deliquencies

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

This graph is also same as above, with no straight-forward relation between income and deliquencies. Borrowers with small as well as large incomes have varying numbers of deliquencies.

Now let’s look into delinquencies vs. debt-to-income ratio

In general there is negative corelation here, the lower the Debt-to-Income ratio the higher are the deliquencies. The number of past deliquencies seen is very high with debt-to-income ratio of around 0.5 The correlation coefficient is:

## [1] -0.04387671

-0.04387671. So the correlation is negative but not meaningful as it’s absolute value < 0.3.

Let’s see if there is any correlation between loan amount and past deliquencies.

In general there is a negative corelation between loan amount and delinquencies. Borrowers with large number of past delinquencies have taken smaller loan amounts.

The correlation coefficient is:

## [1] -0.1366667

-0.1366667. So the correlation is negative but not meaningful as it’s absolute value < 0.3

Part Bivariate Analysis (part 3)

Let’s see if past delinquencies show any impact on CreditScoresMean.

Let’s explore past delinquencies versus loan status.

Looking at the past delinquencies its difficult to predict future loan’s payment status. I was hoping to see fewer past deliquencies for Loan Status = ‘Completed’.

Among borrowers with past delinquencies, -around 14% were chargedoff, -around 5% defaulted, -around 33% completed their loan.

Lets try to see same plot for borrowers with zero past deliquencies

67% of borrowers have zero past delinquencies (for last 7 yrs).

Among borrowers with zero past delinquencies, -around 9% were chargedoff, -around 4% defaulted, -around 33% completed their loan.

Comparing borrowers with past deliquencies vs those with zero deliquencies, we see that ones with past delinquencies have higher percentages of chargedoff/ defaulting loans.

Bivariate Analysis (part 4)

Now let’s look at people who availed loan and are unemployed.

Borrowers with all types of ‘Employment Status’ had either defaulted/chargedoff. Want to see the profile of the unemployed borrowers

## 'data.frame':    835 obs. of  14 variables:
##  $ ListingNumber          : int  122819 704916 506130 578652 556219 570039 489339 639315 487128 606763 ...
##  $ Term                   : int  36 36 36 36 36 36 36 12 36 36 ...
##  $ LoanStatus             : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 2 4 3 2 3 3 3 3 4 ...
##  $ Occupation             : Factor w/ 68 levels "","Accountant/CPA",..: 37 37 37 37 37 37 37 37 37 37 ...
##  $ EmploymentStatus       : Factor w/ 9 levels "","Employed",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ CreditScoreRangeLower  : int  620 720 640 760 660 760 720 660 700 720 ...
##  $ CreditScoreRangeUpper  : int  639 739 659 779 679 779 739 679 719 739 ...
##  $ DelinquenciesLast7Years: int  12 0 0 0 83 0 0 12 0 0 ...
##  $ StatedMonthlyIncome    : num  860 0 0 0 0 0 0 0 0 0 ...
##  $ DebtToIncomeRatio      : num  0.2 NA NA NA NA NA NA NA NA NA ...
##  $ IncomeRange            : Factor w/ 8 levels "$0","$1-24,999",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ LoanOriginalAmount     : int  1000 4000 5000 10000 4000 2000 2500 2000 1000 15000 ...
##  $ ListingCategory        : Factor w/ 21 levels " Other","Auto",..: 15 11 1 1 10 1 1 1 1 5 ...
##  $ CreditScoreMean        : num  630 730 650 770 670 770 730 670 710 730 ...

Among the unemployed borrowers, 23% are chargedoff loans, 3% are defaulting loans, 41% are completed loans

Let’s compare above with employed borrowers (Employed, full-time, Other, part-time, self employed)

Around 92% are employed borrowers based on our definition of ‘employed’ as given in the vector above. Among the employed borrowers, - 10% are chargedoff loans, - 3% are defaulting loans, - 31% are completed loans

Comparing employed and unemployed borrowers, the percentage of loans getting chargedoff is very high with unemployed borrowers.

Lets look at occupations of unemployed borrowers.

##  [1] Other                              Student - College Graduate Student
##  [3] Sales - Commission                 Student - Community College       
##  [5] Psychologist                       Student - College Senior          
##  [7] Student - College Junior           Professional                      
##  [9] Student - College Sophomore        Analyst                           
## [11] Teacher's Aide                     Retail Management                 
## [13] Homemaker                          Sales - Retail                    
## [15] Nurse's Aide                       Waiter/Waitress                   
## [17] Student - Technical School         Student - College Freshman        
## [19] Skilled Labor                     
## 68 Levels:  Accountant/CPA Administrative Assistant Analyst ... Waiter/Waitress

Approx. 800 of 835 unemployed have listed ‘other’ as their occupation while the majority of the remaining appear to be students. This finding was something unexpected for me as I was anticipating the unemployed borrowers to be mostly students. Below I am recreating above plot excluding ‘Other’ occupation.

Lets analyze the loan amounts that were taken by the unemployed borrowers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    4873    6000   25000

50% of unemployed borrowers availed loan amounts between $2500 and $6000. Median being $4000 and maximum being $25000.

Digressing a bit to see how much loan do students borrow in general.

Out of 113937 borrowers, 699 are students. Checking their employment status

It was fascinating to find only about 3% students unemployed. Rest were either working part-time, full-time, self-employed, etc.

Lets look at the loan amount borrowed by students along with loan status

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2000    3000    3764    5000   20000

Its a right skewed distribution. 50% of students borrowed loan between $2000 and $5000, with maximum loan at $20000. I am wondering if all these loans were for education or were personal, etc. Need to see if the whole dataset had a Loan_type variable. The dataset actually has a variable ‘ListingCategory’ which mentions type of loan. Adding this variable to our dataframe by changing code in the beginning. Decoding the values in the ListingCategory with category names. Reference: https://stackoverflow.com/questions/25432935/how-to-replace-values-with-table- map-in-r And then viewing the barchart for ListingCategory

It is interesting to see that around 51% borrowers have borrowed money to pay off their other loans (based on ListingCategory : Debt Consolidation) This makes me want to digress a bit more and analyze these individuals more in terms of LoanStatus

Based on the above barchart, approx. 22% of debt consolidators have completed their loan, 6% chargedoff, 2% defaulted, whereas about 67% have status listed as ‘current’ which makes me think if all Debt Consolidation related loans are pretty recent compared to other loans.

Moving my focus back to students… Redrawing the ListingCategory barchart

Based on the barchart around 1% (approx. 1000) of total borrowers have taken loan for student_use. Now creating similar barchart just for subsetted student dataframe

So out of approx. 1140 borrowers with occupation listed as students less than 100 have ListingCategory as ‘student_use’. Approx. 130 (18%) have listed ‘Debt Consolidation’ and for 270 students the data is unavailable. It is surprising to see almost 18% students taking loans for paying off their old loans.

Moving back to unemployed borrowers… Filtering unemployed borrowers with low credit scores.

The above result set is empty so none of the unemployed borrowers had low credit scores.

Lets analyze the type of loan (ListingCategory) taken by unemployed folks

Based on the above graph appox. 33% (270) of the total unemployed (835) have ‘Debt Consolidation’ as their ListingCategory.

18% have specified ‘other’,12% have not provided this info, 11% have listed ‘Business’, 7% have listed ‘household expenses’ and all other Listing categories are below 5% each.

Taking help from http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to- analyze-format-and-visualize-a-correlation-matrix-using-r-software and http://jamesmarquezportfolio.com/correlation_matrices_in_r.html to visualize the correlation matrix.

The above plot displays positive correlation in blue and negative correlations in red. The size of the circle displays the strength of the relation and the blank cells indicate insignificant relations. The significance level used is 0.01 The only meaningful relations are between loan original amount and creditScoreMean; and loan original amount and term.

Summarizing findings of Bivariate Analysis

Loan Amount vs. CreditScoreMean yeilds a left skewed distribution. With an increase in the creditScoreMean borrowers are able to avail higher amount loans as expected. The surprising aspect of the plot is loans lent to borrowers with low credit scores (almost close to 10).

The correlation coefficient is 0.3408, which means the relation is positive and meaningful but not very strong.

Percentage of defaulters/chargedoff is much higher with low credit score borrowers.

The lower the Debt-to-Income ratio, the higher are the delinquencies (coef = - 0.04387671). So the correlation is negative but not meaningful as it’s absolute value < 0.3.

Borrowers with large number of past delinquencies have taken smaller loan amounts. (coeff = -0.1366667.) So the correlation is negative between loan amount and delinquencies but not meaningful as it’s absolute value < 0.3

Borrowers with one or more past deliquencies have higher percentages of chargedoff/defaulting loans vs zero deliquencies.

Comparing employed and unemployed borrowers, the percentage of loans getting chargedoff is very high with unemployed borrowers.

50% of unemployed borrowers availed loan amounts between $2500 and $6000. Median being $4000 and maximum being $25000.

Out of 113937 borrowers, 699 are students. It was fascinating to find only about 3% students unemployed. Rest were either working part-time, full-time, self-employed, etc.

It is interesting to see that around 51% borrowers have borrowed money to pay off their other loans (based on ListingCategory : Debt Consolidation)

Based on their LoanStatus approx. 22% of debt consolidators have completed their loan, 6% chargedoff, 2% defaulted, whereas about 67% have status listed as ‘current’ which makes me think if all Debt Consolidation related loans are pretty recent compared to other loans.

None of the unemployed borrowers had low credit scores.

Based on the above graph appox. 33% (270) of the total unemployed (835) have ‘Debt Consolidation’ as their ListingCategory. 18% have specified ‘other’,12% have not provided this info, 11% have listed ‘Business’, 7% have listed ‘household expenses’ and all other Listing categories are below 5% each.

The only meaningful relations are between loan original amount and creditScoreMean; and loan original amount and term.

Multivariate Analysis

Let’s see the same plots above using boxplots and removing creditScores < 10 to correctly see the creditScore mean spread.

We see that borrowers with higher credit scores (between 600 and 900) have terms of 12 months and 60 months, while lower credit borrowers have term of 36 months. Also with increase in term the loan amount is also increasing.

Let’s see if we can incorporate loan status in our above geom_point plot. Also I wish to focus only 3 levels of loan status viz. Completed, defaulted and chargedoff. Subsetting loan dataframe to include rows with only “Chargedoff”,“Completed” and “Defaulted” cases. We will look at only complete cases.

Using [https://stackoverflow.com/questions/3472980/ggplot-how-to-change-facet-labels] I was able to add proper names and units for faceted variable ‘Term’.

## `geom_smooth()` using method = 'gam'

From the plot above we see that: Defaulting/ Charged-off loan amounts < = $25,000. The defaulters with 36-months term have credit scores in range 50-430. The chargedoff loans with 36-months term have credit scores in range 440-830. Completion of loan is 100% for loan amounts > $25,000

Final Plots and Summary

Plot One

Loan Amount Vs. Credit Score (Mean)

This plot gives us a good idea of how the loan amounts vary based on Credit Scores. The higher credit scores are availing higher loan amounts, but some borrowers with almost 10 credit score were also able to avail loans; may be they had not built a credit history yet as they may be immigrants new to the country, or never used credit card before, etc.

Plot Two

Loan Original Amount vs. Loan Status

In general we see higher completion rate for higher loan amounts, the defaulting loans as well as Chargedoff loans are not more than $25000.

Plot Three

## `geom_smooth()` using method = 'gam'

Loan Amount Vs. Credit Score (Mean) facetted by Term, colored by LoanStatus

This plot brings all the key findings together. For 12 months term the variability is the lowest for credit score means and LoanAmounts too. The defaulting and chargedoff loans are least for 12 months term.

For 36 months term the borrowers with chargedoff loans are surprisingly in the higher credit score bracket 450-850 approx. The credit score variability is highest for 36 months term.

For 60 months term the spread is highest for Loan Amounts, while credit scores lie between 600 and 850.


Reflection

I went through a bit of struggle when trying to predict categorical variable ‘LoanStatus’ using quantitative independent variables.

During all my exploration I got very few meaningful and zero strong relations between the variables chosen. This was surprising for me since I chose these variables thinking that they are the key factors that can determine loan payment status for a borrower.

I believe going forward I would like to explore a few other variables to see if they are better predictors of loan payment status.